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Abstract 



In recent years, a rich variety of regularization procedures have been proposed for high dimensional 
ryj \ regression problems. However, tuning parameter choice and computational efficiency in ultra-high di- 

mensional problems remain vexing issues. The routine use of ij regularization is largely attributable to 
the computational efficiency of the LARS algorithm, but similar efficiency for better behaved penalties 
has remained elusive. In this article, we propose a highly efficient path following procedure for com- 
qq ■ bination of any convex loss function and a broad class of penalties. From a Bayesian perspective, this 

' algorithm rapidly yields maximum a posteriori estimates at different hyper-parameter values. To bypass 

iy-} , the inefficiency and potential instability of cross validation, we propose an empirical Bayes procedure for 

£C) ' rapidly choosing the optimal model and corresponding hyper-parameter value. This approach applies 

to any penalty that corresponds to a proper prior distribution on the regression coefficients. While we 
f ^ . mainly focus on sparse estimation of generalized linear models, the method extends to more general reg- 

' ularizations such as polynomial trend filtering after reparameterization. The proposed algorithm scales 

efficiently to large p and/or n. Solution paths of 10,000 dimensional examples are computed within one 
minute on a laptop for various generalized linear models (GLM). Operating characteristics are assessed 
through simulation studies and the methods are applied to several real data sets. 

, Keywords: Generalized linear model (GLM); Lasso; LARS; Maximum a posteriori estimation; Model 

selection; Non-convex penalty; Ordinary differential equation (ODE); Regularization; Solution path. 

1 Introduction 

Sparse estimation via regularization has become a prominent research area over the last decade fi nding inter 



est a cross a broad variety of disciplines. Much of the attention was brought by lasso regression ([Tibshirani 



1996), which is simply £1 regularizati on, where i„ is the 77-norm of a vector for r\ > 0. It was not until the 



introduction of the LARS algorithm ([Efron et al 



20041) that lasso became so routinely used. This popu- 



larity is attributable to the excellent computational performance of LARS, a variant of which obtains the 
whole solution path of the lasso at the cost of an ordinary least squares estimation. Unfortunately lasso has 
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literature analyzing lasso and its remedies 
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Knight and Fu, 


2000 


Fan and Li 


2001; 


Yuan and Lin 


2006 
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200G; 


Meinshausen and Biihlmann 


200G: 


Zou and Li 



Motivated by disadvantages of l\ regularization, some non-convex penalties are pro posed which , when 
designed properly, reduce bias in large signals while shrinking nois e -like signals to z ero ( Fan and Li 



Candes et al 



2008 



Friedman 



2008 



Armagan 



2009; 



Zhang , 



2010; 



Armagan et al 



2001 



20111 ). However, non- 



convex regularization involves difficult non-convex optimization. As a convex loss function plus concave 
penalties is a difference of two convex functions, an iterative algorithm for es timation at a fixed regularization 



2010T ) . At each iteration, 



parameter can be constructed by the majorization-minimization principle (jLange 
the penalty is replaced by the supporting hyperplane tangent at the current iterate. As the supporting 
hyperplane majorizes the concave penalty function, minimizing the convex loss plus the linear majorizing 
function (an l\ regularization problem) produces the next iterate, which is guaranteed to decrease the original 
penalized objective function . Many existing algor i thms for est i mation with concave penalties fall into this 



categ ory ([Fan and Li 



2001 



Hunter and Li 



2005 



Zou and Li 



2008 



Candes et al 



2008 



Armagan et al 



20111 ). Although being numerically stable and easy to implement, their (often slow) convergence to a local 
mode makes their performance quite sensitive to starting values in settings where p >> n. Coordinate descent 
is another algorithm for optimization in spar se regression at a fixed tuning parameter value and has found 



success in ultra-high dimens ional problems ([Friedman et al 



2010; 



Mazumder et al 



2007 



Wu and Lange 



2008 



Friedman et al 



20111) . Nevertheless the optimization has to be performed at a large number of grid 
points, making the computation rather demanding compared to path following algorithms such as LARS. 
For both algorithms the choice of grid points is tricky. When there are too few, important events along the 
path are missed; when there are too many, computation becomes expensive. 

The choice of tuning parameter is another important challenge in regularization problems. Cross- 
validation is widely used but incurs considerable computational costs. An attractive alternative is to se- 
lect th e tuning param eter according to a model selection criterio n such as the A kaike information criterion 



(AIC) ((Akaike 



19741 ) or the Bayesian information criterion (BIC) (jSchwarz , 



1978). These criteria choose the 



tuning parameter minimizing the negative log-likelihood penalized by the model d imension. In sh r inkage 



esti mation, how e ver, t he degrees of freedom is often unclear. Intriguing work by 



and 



Wang et al 



Wang and Lend (j2007l ) 



([20071 ) extend BIC to be used with shrin kage estimators. However , the meaning of BIC as 



an empirical Bayes procedure is lost in such extensions. 



Zhang and Huand (|2008l ) study the properties of 



generalized information criterion (GIC) in a similar context. 
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In this paper, we address these two issues for the general regularization problem 



jes 



(1) 



where / is a twice differentiable convex loss function, 5c{l,...,j)} indicates the subset of coefficients being 
penalized (S for shrink), and Prj(t,p) is a scalar penalty function. Here p is the penalty tuning parameter 
and r\ represents possible parameter (s) for a penalty family. Allowing a general penalization set S increases 
the applicability of the method, as we will see in Section |U Throughout this article we assume the following 
regularity conditions on the penalty function P^(t, p): (i) symmetric about in t, (ii) -P^(0, p) > — oo, for 
all p > 0, (iii) monotone increasing in p > for any fixed t, (iv) non-decreasing in t > for any fixed p, (v) 
first two derivatives with respect to t exist and are finite. 

The generality of (TTJ) is two-fold. First, / can be any convex loss function. For least squares problems, 
/(/3) = \\y — X(3\\%/2. For generalized linear models (GLMs), / is the negative log-likelihood function. 
For Gaussian graphical models, /(fi) = — logdet \Sl\ + tr(SO), where £ is a sample covariance matrix and 
the parameter f2 is a precision matrix. Secondly, most commonly used penalties satisfy the aforementioned 
assumptions. These include but are not limited to 



1. Power family (jFrank and Friedman 



1993J) 



Pn 



Ip) = p\P\\ 7?e(o,2]. 



Varying 77 from 2 bri dges best subset regressio n to lasso (7i) (jTibshirani 



ridge (1%) regression (jHoerl and Kennard . 



1970). 



1996 



Chen et al 



2. Elastic net ( Zou and Hastie 



20051) 



20011) to 



lp)=p[{r 1 -l)fi 2 /2 + {2-i 1 )\fi\], r, g [1,2]. 



Varying rj from 1 to 2 bridges lasso regression to ridge regression. 



3. Log penalty (jCandes et al 



2008 



Armagan et al 



2011) 



,p) = pln(r) + 



T]>0. 



This p enalty was called generalized elastic net in 



Friedman! (|2008l) and log penalty in 



Mazumder et al 



(|201ll ). Such a penalty is induc ed by a generali z ed Pa reto prior thresholded and folded at zero, with 



the oracle properties studied by 



Armagan et al 



(120111) 



4. Continuous log penalty (jArmagan et al 



2011) 



P(\0\,p)=p]n(Vp- 
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This version of log penalty was designed to guarantee 



matrix is scaled and orthogonal (jArmagan et al 



continuity of the solution path when the design 



2011). 



5. The SCAD penalty (|Fan and Li 



2001) is defined via its partial derivative 



p) = p\ hm< P } + 



(vp-\P\)- 



iW\>p} 



Integration shows SCAD as a natural quadratic spline with knots at p and rjp 

[p\P\ \P\<P 



i]>2. 



(2) 



,p 2 (77+l)/2 



> VP 



For small signals \j3\ < p, it acts as lasso; for larger signals |/3| > r/p, the penalty flattens and leads to 
the unbiasedness of the regularized estimate. 



6. Similar to SCAD is the MC+ penalty (jZhangl . 120101 ) defined by the partial derivative 



p) = p 1 - ^ 

PV/ + 



Integration shows that the penalty function 



1 



2r, J -^K"} ' 2 

is quadratic on [0, prf\ and flattens beyond pr\. Varying 77 from to 00 bridges hard thresholding (£ 
regression) to lasso (£1) shrinkage. 

The derivatives of penalty functions will be frequently used for the development of the path algorithm and 
model selection procedure. They are listed in Table [5] of Supplementary Materials for convenience. 
Our contributions are summarized as follows: 

1. We propose a general path seeking strategy for the sparse regression framework |T]). To the best of 
our knowledge, no previous work exists at this generality, e xcept the generalized path seeking (GPS) 



P 2 V 



{\P\>pv}> 



V > 0, 



(3) 



algorithm proposed in unpublished work by 



Friedman! (|2008l ). Some problems with the GPS algorithm 



motivated us to develop a more rigorous algorithm, which is fundamentally different from GPS. Path 
algorithms for some specific com binations of loss and pe nalty functions ha ve been studied before. 



Homotopy (jOsborne et al 



2004) co mpute the piecewise 



Park and Hastid (|2007h p roposed an 



20001) and a variant of LARS (lEfron et al 
linear solution path of £\ penalized linear regression efficiently 

approximate path algorithm for £1 penalized GLMs. A similar problem was considered by 
who devises a LARS algorithm for GLMs based on ordinary differential equations (ODEs). The ODE 
approach naturally fits problems with piecewise smooth solution paths and is the strategy we adopt 



Wul (120111) 
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in this paper. All of the aforementioned work deals with t\ regularization which leads to convex 
optimization problems. Moving from convex to non-convex penalties improves the quality of the 



2010) 



estimates but imposes great difficulties in computation. The PLUS path algorithm of (| Zhang 
is able to track all local minima; however, it is specifically designed for the least squares problem with 
an MC+ penalty and does not generalize to ([1]). 

2. We propose an empirical Baycs procedure for the selection of a good model and the implied hy- 
per/tuning parameters along the solution path. This method applies to any likelihood model with a 
penalty that corresponds to a proper shrinkage prior in the Bayesian setting. We illustrate the method 
with the power family (bridge) and log penalties which are induced by the exponential power and 
generalized double Pareto priors respectively. The regularization procedure resulting from the corre- 
sponding penalties is utilized as a model-search engine where each model and estimate a long t he path 



is appropriately evaluated by a criterion emerging from the prior used. 



Yuan and Linl (|2005f ) took a 



somewhat similar approach in the limited setting of l\ penalized linear regression. 

3. The proposed path algorithm and empirical Bayes model selection procedure extend to a large class of 
generalized regularization problems such a s polynomial trend filter i ng. P ath a lgorithms for generalized 



regularization was recen tly studied by 



for linear regression and by 



Zhou and Wu 



Tibs hirani and Tavlorl (|201ll) and 



Zhou and Langel pOllI ) 



(|201 If) for general convex loss functions. Using non-convex 
penalties in these general regularization problems produces more parsimonious and less biased esti- 
mates. Re-parameterization reformulates these problems as in (JlJ which is solved by our efficient path 
algorithm. 

4. A Matlab toolbox for sparse regression is released on the first author's web site. The code for 
all examples in this paper is available on the same web site, observing the principle of reproducible 
research. 

The remainder of the article is organized as follows. The path following algorithm is derived in Section 
[2J The empirical Bayes criterion is developed in Section [3] and illustrated for the power family and log 
penalties. Section 2] discusses extensions to generalized regularization problems. Various numerical examples 
are presented in Section [5j Finally we conclude with a discussion and future directions. 

2 Path Following for Sparse Regressions 

For a parameter vector (3 £ MP, we use Sz(/3) = {j E S : (3j = 0} to denote the set of penalized parameters 
that are zero and correspondingly 5g(/3) = {j £ <S : j3j ^ 0} is the set of nonzero penalized parameters. 
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A = S U <Sg indexes the current active predictors with unpenalized or nonzero penalized coefficients. It is 
convenient to define the Hessian, H_^(f3,p) G IRj- 4 !*!-^, of the penalized objective function (JXJ) restricted to 
the active predictors with entries 

Our path following algorithm revolves around the necessary condition for a local minimum. We denote the 
penalized objective function (Q]) by h((3) throughout this article. 

Lemma 2.1 (Necessary optimality condition). If (3 is a local minimum of ^j) at tuning parameter value p, 
then (3 satisfies the stationarity condition 



^ , 9P(\/3j\ >P ) 
if (® + — WW7\ — "Auesy = 0, 

where the coefficients ojj satisfy 



Vi/09) + ^ r J Uj l {j€S} = 0, j = ,p, (5) 



'{-I} ft <0 
^■€^[-1,1] ft- =0. 
{1} ft- >0 

Furthermore, Hj,((3,p) is positive semidefinite. 

Proof. When the penalty funct ion P is convex, th is is simply the first order optimality condition for uncon- 



strained convex minimization ( Ruszczvnski , 



20061 Theorem 3.5). When P is non-convex, we consider the 
optimality condition coordinate- wise. For j g {j : (3j ^ 0}, this is trivial. When ft = 0, (3j being a local 
minimum implies that the two directional derivatives are non-negative. Then 

d-.,h(/3) - ta W + . -V,/(ffl + > o, 

which is equivalent to ([5]) with ujj G [—1,1]. Positive semidefiniteness of iJ_4 follows from the second order 
necessary optimality condition when restricted to coordinates in A. □ 

We call any (3 satisfying ([5]) a stationary point at p. Our path algorithm tracks a stationary point along 
the path while sliding p from infinity towards zero. When the penalized objective function h is convex, e.g., 
•q G [1,2] regime of the power family, elastic net, or d 2 h is positive semidefinite, the stationarity condition 
(|S|) is sufficient for a global minimum. When h is not convex, the minimization problem is both non-smooth 
and non-convex and there lacks an easy-to-check sufficient condition for optimality. The most we can claim 
is that the directional derivatives at any stationary point are non-negative. 



G 



Lemma 2.2. Suppose (3 satisfies the stationarity condition 0). Then all directional derivatives at (3 are 
non-negative, i.e., 



KH> 40 t 



(6) 



for any i)£l p . Furthermore, if the penalized objective function h is convex, then (3 is a global minimum. 
Proof. By definition of directional derivative and the stationarity condition (|5|), 

dP v (t,p) 



d v h((3) = d v f([3)+ Yl ^^4^ sgn(&)+ E \»i 

dP v {t,p) 



dt 



= EN (sgnfe) • V,/(/3) + 
HA V 
0. 



ot 



sgn(ft-)+ E M 

i= lft'l iG5:/3 J= 



t=0 

0P n (i,p) 



0f 



t=o 



Consider the scalar function = h(f3 + tv). Convexity of h implies that g is convex too. Then the chord 
[g(t) - g(Q)]/t = [h((3 + tv) - h(f3)]/t is increasing for t > 0. Thus h((3 + v) - h(/3) > d v h(/3) > for all v, 
verifying that [3 is a global minimum. □ 

We remark that, without convexity, non-negativeness of a ll direc t ional derivatives does not guarantee 



2004 Exercise 1.12). Consider the 



local minimality, as demonstrated in the following example (|Lange . 
bivariate function f(x,y) = (y — x 2 )(y — 2a; 2 ). Any directional derivative at origin (0,0) is non-negative 
since \im.t^.o[f(ht, kt) — f(0, 0)]/t = for any h,k £ R and indeed, (0,0) is a local minimum along any line 
passing through it. However (0,0) is not a local minimum for / as it is easy to see that f(t, ct 2 ) < for any 
1 < c < 2, t 7^ 0, and that f(t,ct 2 ) > for any c < 1 or c > 2, t ^ 0. Figure [T] demonstrates how we go 
down hill along the parabola y = lAx 2 as we move away from (0,0). In this article, we abuse terminology 
by the use of solution path and in fact mean path of stationarity points. 

2.1 Least squares with orthogonal design 

Before deriving the path algorithm for the general sparse regression problem ([!}, we first investigate a 
simple case: linear regression with orthogonal design, i.e., f{j3) = \\y — X[3\\\/2 where x^x^ = || 1 1 i 1 {j =fe> • 
This serves two purposes. First it illustrates the difficulties (discontinuities, local minima) of path seeking 
with non-convex penalties. Secondly, the thresholding opera t or for orthogon a l desi g n is the building block 



of the coordinat e desc ent algorithm ([Friedman et al 



Mazumder et al 



2007 



Wu and Lange . 



2008 



Friedman et al 



2010: 



20111) or iterative thresholding, which we rely on to detect the discontinuities in path 



following for the non-convex case. 
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x 

Figure 1: Contours of f(x,y) = (y — x 2 )(y — 2x 2 ) and the parabola y — 1.4a; 2 that passes through (0,0), 
which is not a local minimum although all directional derivatives at (0,0) are nonnegative. 

For linear regression with orthogonal design, the penalized objective function in ([I]) can be written in a 
component-wise fashion and the path solution is 



$j(p) = argmin^ -±(J3 - b 3 f + P v (\/3\,p) 
where aj — x*Xj = \\xj\\ 2 and bj = x^y/x^xj. The solution t o Q for some popular 



Supplementary Materials. Similar derivations can be found in ([Mazumder et al 



(7) 



penalties is listed in 



20111 ). Existing literature 



mostly assumes that the design matrix is standardized, i.e., dj — 1. As we argue in forthcoming examples, 
in many applications it is prudent not to do so. Figure [2] depicts the evolution of the penalized objective 
function with varying p and the solution path for aj = b 3 ■ = 1 and the log penalty P,,(|/?|, p) — pln(|/3| + r/) 
with rj — 0.1. At p — 0.2191, the path solution jumps from a local minimum to the other positive local 



minimum. 



For the least squares problem with a non-orthogonal design, the coordinate descent algorithm iteratively 
updates /3j by ([7]). When updating j3j keeping other predictors fixed, the objective function takes the same 
format with aj = \\xjW 2 , and bj — Xj(y — X-j/3_j)/xjXj, where -X"-j and (3_j denote the design matrix 
and regression coefficient vector without the j-th covariate. For a general twice differentiable loss function 
/, we approximate the smooth part / by its Taylor expansion around current iterate (3^ 

f(J3) « /(/3 (t) ) + df(f3M)(f3 - 0®) + l -{(3- (3 (t ^ t d 2 f(f3^)(f3 - f}®) 



and then apply thresholding formula (J7J for j3j sequentially to obtain the next iterate f3 



(t+i) 
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0.21910.25 0.3025 
P 



Figure 2: Log penalty with orthogonal design: a = 1, b = 1, r] = 0.1. Left: Graphs of the penalized 
objective function a(/3 — b) 2 /2 + p\xi{rj + |/J|) at different p. Right: Solution path. Here abrj = 0.1 and 
a{q + \b\) 2 /4 = 0.3025. Discontinuity occurs somewhere between these two numbers. 

2.2 Path following for the general case 

The preceding discussion illustrates two difficulties with path seeking in sparse regression using non-convex 
penalties. First the solution path may not be continuous. Discontinuities occur when predictors enter the 
model at a non-zero magnitude or vice versa. This is caused by jumping between local minima. Second, in 
contrast to lasso, the solution path is no longer piecewise linear. This prohibits making giant jumps along 
the path like LARS. 

One strategy is to optimize (p} at a grid of p enalty intensity p using coordinate descent. This has found 



great success with lasso an d elastic net regre s sion ( Friedman et al 



2010)). The recent article ([Mazumder et al 



2007 



Wu and Lange . 



2008 



Friedman et al 



20111 ) explores the coordinate descent strategy with non-convex 
penalties. In principle it involves applying the thresholding formula to individual regression coefficients until 
convergence. However, determining the grid size for tuning parameter p in advance could be tricky. The 
larger the grid size the more likely we are to miss important events along the path, while the smaller the 
grid size the higher the computational costs. 

In this section we devise a path seeking strategy that tracks the solution path smoothly while allowing 
abrupt jumping (due to discontinuities) between segments. The key observation is that each path segment 
is smooth and satisfies a simple ordinary differential equation (ODE). Recall that the active set A = SUSg 
indexes all unpenalized and nonzero penalized coefficients and if _4 is the Hessian of the penalized objective 
function restricted to parameters in A. 



Proposition 2.3. The solution path f3(p) is continuous and differentiable at p if H^(/3,p) is positive 
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definite. Moreover the solution vector (3(p) satisfies 

^APl = -H A 1 (f3,p)-u A ((3,p), (8) 

where the matrix H A *s defined by ^ and the vector u A (/3) has entries 

( d 2 P,{\fi 3 \,p) ( a \ ■ p o 

[0 jeS 

Proof. Write the stationarity condition ([3]) for active predictors as a v e ctor e quation k(/3 A ,p) = 0. To solve 



for [3 A in terms of p, we apply the implicit function theorem (jLange , 



2004). This requires calculating the 



differential of k with respect to the dependent variables (3 A and the independent variable p 

dp A k{f3 A ,p) = H A {f3,p) 
d p k([3 A ,p) = u A (f3). 

Given the non-singularity of H A (f3,p), the implicit function theorem applies and shows the continuity and 
differentiability of (3 A (p) at p. Furthermore, it supplies the derivative <JSj> - □ 



Proposition 12.81 suggests that solving the simple ODE segment by segment is a promising path following 
strategy. However, the potential discontinuity along the path caused by the non-convex penalty has to be 
taken care of. Note that the stationarity condition for inactive predictors implies 



V,/(/3) 



jeS 2 



3 w\ p m, P y 

and provides one indicator when the coefficient should escape to during path following. However, due to 
the discontinuity, a regression coefficient /3j, j £ Sz, may escape with ujj in the interior of (-1,1). A more 
reliable implementation should check whether an inactive regression coefficient (3j becomes nonzero using 
the thresholding formulae at each step of path following. Another complication that discontinuity causes is 
that occasionally the active set Sz may change abruptly along the path especially when predictors are highly 
correlated. Therefore whenever a discontinuity is detected, it is advisable to use any nonsmooth optimizer, 
e.g., coordinate descent, to figure out the set configuration and starting point for the next segment. We 
pick up coordinate descent due to its simple implementation. Our path following strategy is summarized in 
Algorithm [1] 

Several remarks on Algorithm [T] are relevant here. 

Remark 2.4 (Path following direction). The ODE (0) is written in the usual sense and gives the derivative 
as p increases. In sparse regression, we solve in the reverse direction and shall take the opposite sign. 
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Algorithm 1 Path following for sparse regression. 



Determine the first penalized predictor j* to enter model and the corresponding p max 
Initialize S z = {j*}, Sg=S\ {j*}, and (3(p max ) = axgmin ( g s=0 /( i 9) 
repeat 
Solve ODE 

*P^M = -H A 03,p)- l u A (p,p) 

until (1) an active penalized predictor 0j, j € 5g, becomes 0, or (2) an inactive penalized coefficient 
Wj, j £ Sz, hits 1 or -1, or (3) an inactive penalized predictor Bj, j € Sz, jumps from to a nonzero 
minimum, or (4) the matrix H A (f3,p) becomes singular, 
if (1) or (2) then 

Update sets Sz and Sz 
else 

Apply coordinate descent at current p to determine Sz, Sz and /3_4 for next segment 
end if 

until termination criterion is met 



Remark 2.5 (Termination criterion). Termination criterion for path following may depend on the specific 
likelihood model. For linear regression, path seeking stops when the number of active predictors exceeds the 
rank of design matrix \A\ > rank(JT). The situation is more subtle for logistic or Poisson log-linear models 
due to separation. In binary logistic regression, complete separation occurs when there exists a vector zel p 
such that x\z > for all yi = 1 and x\z < for all yi — 0. When complete separation happens, the 
log-likelihood is unbounded and the MLE occurs at infinity along the direction z. The log-likelihood surface 
behaves linearly along this direction and dominates many non-convex penalties such as power, log, MC+, 
and SCAD, which is almost fiat at infinity. This implies that the penalized estimate also occurs at infinity. 
Path seeking should terminate whenever separation is detected, which may happen when \A\ is much smaller 
than the rank of the design matrix in large p small n problems. Separation occurs in the Poisson log-linear 
model too. Our implementation also allows users to input the maximum number of selected predictors until 
path seeking stops, which is convenient for exploratory analysis of ultra-high dimensional data. 

Remark 2.6 (Computational Complexity and Implementation). Any ODE solver repeatedly evaluates the 
derivative ([5]). The path segment stopping events (l)-(4) are checked during each derivative evaluation. 
Since the Hessian restricted to the active predictors is always positive semidefinite and the inactive penalized 
predictors are checked by thresholding, the qualit y of solution along the path is as good as any fixed tuning 



parameter optimizer such as coordinate descent iMazumder et al 



201m ). Computational complexity of Al- 



gorithm [7] depends on the loss function, number of smooth path segments, and the method for solving the 
ODE. Evaluating derivative ([H]) takes 0{n\A\ 2 ) flops for calculating the Hessian of a GLM loss I and takes 
0(|.4| 3 ) flops for solving the linear system. Detecting jumps of inactive penalized predictor by thresholding 
takes 0(\A\) flops. The cost of 0{\A\^) per gradient evaluation is not as daunting as it appears. Suppose any 
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fixed tuning parameter optimizer is utilized for path following with a warm start. When at a new p, assum- 
ing that the active set A is known, even the fastest Newton's method needs to solve the same linear system 
multiple times until convergence. The efficiency of Algorithm{l\lies in the fact that no iterations are needed 
at any p and it adaptively chooses step sizes to catch all events along the path. Algorithm [7] is extremely 
simple to implement using s oftware with a reliabl e ODE solver such as the ode45 function in Matlab and 



the deSolve package in R tSoetaert et 



201(a ). For instance, the rich numerical resources of Matlab 
include differential equation solvers that alert the user when certain events such as those stopping rules in 
Algorithm]]] are fulfilled. 

Remark 2.7 (Knots in SCAD and MC+). Solving the ODE {3p requires the second order partial derivatives 
0?p i i P(\/3\, p) and o \ p \ o p P{\P\: p) of the penalty functions, which are listed in Tabled Due to their designs, 
these partial derivatives are undetermined for SCAD and MC+ penalties at the knots: {p, rjp} for SCAD 
and {rjp} for MC+. However only the directional derivatives are needed, which are well-defined. Specifically 
we use Q?m'i P(\P\i P-) an d gpTg^ -^'G/^li P—)- ^ n practice, the ODE solver rarely steps on these knots exactly 
due to numerical precision. 

Finally, switching the role of p and 77, the same argument leads to an analogous result for path following 
in the penalty parameter rj with a fixed regularization parameter p. In this article we focus on path following 
in p with fixed rj in the usual sense. Implications of the next result will be investigated in future work. 

Proposition 2.8 (Path following in 77). Suppose the partial derivative 9J Q t g^ exists at allt>0 and p. For 
fixed p, the solution path (3{tj) is continuous and differ entiable at rj if H a{(3, 77) is positive definite. Moreover 
the solution vector f3(rj) satisfies 

= -H A \(3, v ).u A (f3,v), 
where the matrix Ha * s defined by Q) and the vector UjXfi) has entries 

u J {f3.rj) = \ W S9n{ ^ } j€Sz . 

3 Empirical Bayes Model Selection 

In practice, the regularization parameter p in sparse regression is tuned according to certain criteria. Often 
we wish to avoid cross-validation and rely on more efficient procedures. AIC, BIC and similar variants have 
frequently been used. Recall that BIC arises from a Laplace approximation to the log-marginal density of the 
observations under a Bayesian model. The priors on the parameters are specifically chosen to be normal with 
mean set at the maximum likelihood estimator and covariance that conveys the Fisher information observed 

12 



from one observation. This allows for a rather diffuse prior relative to the likelihood. Hence the resulting 
maximum a posteriori estimate is the maximum likelihood estimator. Often users plug in the estimates from 
sparse regression into AIC or BIC to assess the quality of the estimate/model. In this section we derive an 
appropriate empirical Bayes criterion that corresponds to the exact prior under which we are operating. All 
necessary components for calculating the empirical Ba yes criterion fa l l out nicely from the path following 
algorithm. A somewhat similar approach was taken by lYuan and Linl (|2005f ) to pick an appropriate tuning 



parameter for the lasso penalized least squares noting that the underlying Bayesian model is formed with a 
mixture prior - a spike at zero and a double exponential distribution on /3j S R. 

Conforming to previous notation, a model is represented by the active set A = SUSz which includes both 
un-penalized and selected penalized regression coefficients. By Bayes formula, the probability of a model A 
given data y is 

P (A\y) = ^M^. 

p{y) 

Assuming equal prior probability for all models, an appropriate Bayesian criterion for model comparison is 
the marginal data likelihood p(y\A) of model A. If the penalty in the penalized regression is induced by a 
proper prior n((3) on the regression coefficients, the marginal likelihood is calculated as 

p(y\A) = J n(f3 A , y) d(3 A = J n(y\f3 A ) ]J Tr(ft-) d(3 A . (9) 

jeA 

In most cases the integral cannot be analytically calculated. Fortunately the Laplace approximation is a 
viable choice in a similar manner to BIC, in which 7r(/3j) is taken as the vaguely informative unit information 
prior. We illustrate this general procedure with the log and power penalties. Note that both the regularization 
parameter p and penalty parameter r\ are treated as hyper-parameters in priors. Thus the procedure not 
only allows comparison of models along the path with fixed r\ but also models with distinct r\. 

3.1 Log penalty 



The log penalty arises from a generalized double Pareto prior (jArmagan et al 



20111 ) on regression coefficients 



7r(/3| a ,r,) = ^-(|/3|+ry)-(« +1 ), a, V > 0. 

Writing p = a + 1 and placing a generalized double Pareto prior on /3j for active coefficients j 6 A estimated 
at (/J, rf) , the un- normalized posterior is given by 

-1W-1V 



7r(P A ,y\p,Ti) - | (P ^ } expj^)-pgln(ry + |^|) 

( P IV- 1 1 q 



jeA 

exp{-h(P A )} , 
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where q = \A\ and h((3 A ) = —£(/3 A ) + p + Then a Laplace approximation to the integral 

([9]) enables us to assess the relative quality of an estimated model A at particular hyper/tuning parameter 
values (p, rj) by EB(p,?7) = — lnp(y|.4). The following result displays the empirical Bayes criterion for the 
log penalty and then specializes to the least squares case which involves unknown variance. Note that, by 
the definition of double Pareto prior, p > 1. Otherwise the prior on f3j is no longer proper. 

Proposition 3.1. For p > 1, an empirical Bayes criterion for the log penalized regression is 

EBi og (p, rj) = -gin | (|) V2 (p - l)^ 1 j + h([3) + \ logdet H A {P), 

where (3 is the path solution at (p, 77), A = A((3) is the set of active regression coefficients, q — \A\, and Ha 
is the restricted Hessian defined by |^). Under a linear model, it becomes 

EB logM . -qlJ (f!) V2 (A _ x) rffA + + \ logdet H A 0), 



whe 



-2 ■ J n-q 2 qp ( p \ h(0) 
a = argmm cr 2 < — In er H ^ lm] + q In y — - I J — 



Proof. The Laplace approximation to the normalizing constant © is given by 

\np(y\A) « In tt(/3 1, y\p, rf) + J ln(27r) - - log det d 2 h{[3 A 



where = argmhig^ /i(/3^) = argmin^ - £(fi A ) + p E je ^ ln (»7 + Ift'l) and [^M/^bfc = [-^(/^bfc + 
p(r/ + I I ) 2 l{j = fc} for j, k E A. Then the empirical Bayes criterion is 

EB(p,77) = -lnp(y\A) 

fa gin 2 - g(p - 1) In?/ - gln(p - 1) + mmh((3j) - ^ ln(27r) + ^ logdetd 2 /i(/3^) 

= -g In j (|) V2 (p- IV" 1 } + ^11!^) + ^ log det d 2 /!^). 

Now consider the linear model with unknown variance a 2 , 

2N-n/2 ( ^ff^ "' 



\y - X A f3 A \\l + 2a 2 (a + 1) £ lnfa + \(3 3 , 



X6XP 2^ 



(27T(7 ) 7 <^ - ^ expl-Zi^j/cr }, 



where p = er 2 (a + 1) and h([3 A ) = \\y — X aPaW^/^ + P^jeA + The Laplace approximation to 

the normalizing constant is then given by 

lnp(y\A,a 2 ) « ln^, yL4, <x 2 ) + | ln(27r) - ^logdet[a- 2 d 2 h(f3 A )}, 
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which suggests the empirical Bayes criterion 

EB(7?,p|<7 2 ) « ^y^ln(2^ 2 ) + 9^2-9(^-1)111^-^111(^-1 

+ ^4^ + Jlogdet^). 
o~ z 

Given .4, we can easily compute the value a 2 that minimizes the right-hand side 

~2 • f n -g i 2 9P, , / P ,\ , min^ fe(/3^) ^ 
0- = argmin CT 2 < — - — In a - In 77 - gin ^—^ - 1J H ^ r 



3.2 Power Family 

The power family penalty is induced by an exponential power prior on the regression coefficients 
The unnormalized posterior of the regression coefficients given a model A can be written as 



□ 



'3 \ v 



' 1 1 exp{-h((3 A )} (10) 



V2r(l/ry)^ 

Again the Laplace approximation to the posterior p(y\A) yields the following empirical Bayes criterion for 
power family penalized regression. 

Proposition 3.2. An empirical Bayes criterion for the power family penalized regression is 

EBpp(p, rf) = -gin J-^ + h((3) + - log dct H A ([3). 
For linear regression, it becomes 

EB PF (p, ,) = a In ^1 + f HZ£ - ^ / 1 + m , , 1 + \ log det ^ (/ 3) . 



V2T(l/ V ) V 2 r?y [ (n - q)/2 + q/r, J 2 

Proof. Given a certain model A observed at (p,rj), the Laplace approximation to the normalizing constant 
is given by 

\np(y\A) « ln7r(^,y|i) + | ln(27r) - 1 log dct d 2 ft (^) 

where ^ = argmin^/i^) = argmin^ - l{f3j) + pE je A l&l" and [ rf2/l (^)]ifc = + - 

l)|ftf'- 2 l {j = fc} for , 7 , fee A Then 

lnp(y|.4) « |ln(27r)-gln2 + glnr ? + -lnp-glnr(l/r7) 
z 77 



minM^-ilogdetd 2 ^^), 
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which yields 



EB(p, rj) = -q In ^^.\ + min h(fi Ji ) + \ log det d 2 h{{3^ 



Now consider the linear model case, 



7r(^,y|p,r?,a 2 ) = {2-ko 1 ) 



2\-n/2 



( w 



1/'/ 



V2cr 2 A)r(l/77) 



exp ■ 



l»-X^g/2 + pE„,|ft|i 



where h(f3^) = \\y — X^,f3^\ 2 ,/'^ + pX^e.A l/^ji 17, The Laplace approximation to the normalizing constant 
at an estimated model A is then given by 



\np(y\A,a 2 ) « ln^(^,y|A a 2 ) + | In(2ir) - ^ logdct[a- 2 d 2 ^(/3^)], 



where /3j[ = argming h((3j) and [d 2 /^/^)]./*; = a^tCfc + ^77(77 — 1) T' 2 l{j=/c} f° r j> k € A. Then 

x/^7P 1/?) 



lnp(y|„4, cr 



q In ■ 



v/2r(i/7?) 

Plugging in the maximizing a 2 



n-q q 
V 



- I In n :1 - !^Al _ I logdetd 2 /i(/3_4) - '- ln27r. 



n 
2 



a 2 



(n - g)/2 + g/r? 
and omitting the constant term (nln27r)/2, we obtain 



V2r(i/, 7 ) 



□ 



4 Further Applications 

The generality of ([l} invites numerous applications beyond variable selection. After reparameterization, 
many generalized regularization problems are subject to the path following and empirical Bayes model 
selection procedure developed in the previous two sections. In this section we briefly discuss some further 
applications. 



The recent articles ([Tibshirani and Taylor 



2011 



Zhou and Lange . 



2011 



Zhou and Wu 



20111) consider 



the generalized t\ regularization problem 

nnn/(/3)+p||F/3|| 1+ p||^/3|| + , 

where ||a||+ = ^^axjoijO} is the sum of positive parts of its components. The first regularization term 
enforces equality constraints among coefficients at large p while the second enforces inequality constraints. 
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Applications range from i\ penalized GLMs, shape restricted regressions, to nonparamctric density estima- 
tion. For more parsimonious and unbiased solutions, generalized sparse regularization can be proposed 

r s 

min f((3) + PWPIP) + E P +K"/3> P)> ( n ) 

i=l j=l 

where P is a non-convex penalty function (power, double Pareto, SCAD, MC+, etc.) and P+(t, p) = P(t, p) 
for t > and P(0,p) otherwise. Devising an efficient path algorithm for (fTTj) is hard in general. However, 
when {vi} and {it)j} are linearly independent, it can be readily solved by our path algorithm via a simple 
reparameterization. For ease of presentation, we only consider equality regularization here. Let the matrix 
V G M rxp collect Vi in its rows. The assumption of full row rank of V implies r < p. The trick is to 
reparameterize (3 by 7 = V(3 where V E M. pxp is the matrix V appended with extra rows such that V 
has full column rank. Then original coefficients (3 can be recovered from the reparameterized ones 7 via 
(3 = (V t V)~ 1 V t 'j. The reparameterized regularization problem is given by 

mm /[( VV)- 1 VH + £ P(\lj\,p) (12) 

which is amenable to Algorithm [T] Note that / remains convex and twice diffcrentiable under affine trans- 
formation of variables. 

Regularization matrix V with full row rank appears in numerous applications. For fused lasso, the 
regularization matrix 

/ -1 1 



■1 1 



has full row rank. In polynomial trend filtering (jKim et al 



2009: 



Tibshirani and Taylor 



20111) , order d finite 



differences between successive regression coefficients arc penalized. Fused lasso corresponds to d = 1 and the 
general polynomial trend filtering invokes regularization matrix Vd — Vd—iVi, which again has full row 
rank. In Section [5. 31 cubic trend filtering for logistic regression is demonstrated on a financial data set. In all 
of these applications the regularization matrix V is highly sparse and structured. The back transformation 
(V t V)~ 1 V t ~f in (fT2"j) is cheap to compute using a pre-computed sparse Cholesky factor of V V. The design 
matrix in terms of variable 7 is X(V t V)~ 1 V 1 . In contrast to the regular variable selection problem, it shall 
not be assumed to be centered and scaled. 

5 Examples 

Various numerical examples in this section illustrate the path seeking algorithm and empirical Bayes model 
selection procedure developed in this article. The first two classical data sets show the mechanics of the 
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path following for linear and logistic regressions and compare the model fit and prediction performance 
under various penalties. The third example illustrates the application of path algorithm and empirical Bayes 
procedure to cubic trend filtering in logistic regression using a financial data set. The last simulation example 
evaluates the computational efficiency of the path algorithm in a large p small n setting. Run times are 
displayed whenever possible to indicate the efficiency of our path following algorithm. The algorithm is run 
on a laptop with Intel Core i7 M620 2.66GHz CPU and 8 GB RAM. For reproducibility the code for all 
examples is available on the first author's web site. 

5.1 Linear regression: Prostate cancer data 



2009). The response variable 



The first example concerns the classical prostate cancer data in (jHastie et al. 
is logarithm of prostate specific antigen (ipsa) and the seven predictors are the logarithm of cancer volume 
(lweight), age, the logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion 
(svi), the logarithm of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 
4 or 5 (pgg45). The data set contains 97 observations and is split into a training set of size 67 and a test set 
of 30 observations. 

Figure [3] displays the solution paths of linear regression with nine representative penalties on the training 
set. Discontinuities occur in the paths from power family with r/ = 0.5, continuous log penalty, and the log 
penalty with r) = 1. In contrast, the lasso solution path, from cither cnct(l) or power (1), is continuous and 
piecewise linear. Figure [3] also illustrates the trade-off between continuity and unbiasedness. Using convex 
penalties, such as enet (77 = 1, 1.5, 2) and power (77 = 1), guarantees the continuity of solution path but causes 
bias in the estimates along the solution path. For a non-convex penalty such as power (77 = 0.5), estimates 
are approximately unbiased once selected. However this can only be achieved by allowing discontinuities 
along the path. 

To compare the model fit along the pat hs, it is more inf ormative to plot the explained variation versus 



model dimension along the solution paths ([Friedman 



20081 ). Upper panels of Figure 0] display such plots 



for the enet, power, and log penalties at various penalty parameter values 77. y-axis is the proportion 
R 2 (p)/R 2 (0), i.e., the R 2 from the path solutions scaled by the maximum explained variation i? 2 (0). Results 
for other penalties (MC+, SCAD) are not shown for brevity. Non-convex penalties show clear advantage in 
terms of higher explanatory power using fewer predictors. The model fit of path solutions in the test set 
shows similar patterns to those in Figure [5] To avoid repetition, they are not displayed here. 

To evaluate the prediction performance, the prediction mean squared errors (MSE) on the test set from 
the solution paths are shown in the lower panels of Figure 0J Different classes of penalties all achieve the 
best prediction error of 0.45 with 4-6 predictors. It is interesting to note the highly concave penalties such 
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Figure 3: Solution paths for the prostate cancer data. 

as power (77 = 0.2) do not achieve the best prediction error along the path. Lasso and moderately concave 
penalties are quite competitive in achieving the best prediction error along the path. Convex penalties like 
enet with 77 > 1 tend to admit too many predictors without achieving the best error rate. 

5.2 Logistic regression: South Africa heart disease data 

For demonstration of logistic regression, we again use the classical South Africa heart disease data set in 



(Hastieetal 



20091) . This data set has n = 462 observations measured on 7 predictors. The response variable 
is binary (heart disease or not). We split the data set into a training set with 312 data points and a test 
set with 150 data points. Solution paths are obtained for the training data set from various penalties and 
are displayed in Figure [5l Similar patterns are observed as those for the prostate cancer linear regression 
example. The discontinuities for concave penalties such as power (rj = 0.5) and log penalty (77 = 1) lead 
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power/enet, 1 .4s 



penalty, 1 .6s 
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Figure 4: Upper panels: R 2 vs model dimension from various penalties for the prostate cancer data. Lower 
panels: Prediction mean square error (MSE) vs model dimension from various penalties for the prostate 
cancer data. 



to less biased estimates along the paths. The plots of explained deviance versus model size for selected 
penalties are given in the upper panels of Figure [5] Solutions from concave penalties tend to explain more 
deviance with fewer predictors than lasso and enet with rj > 1 . Deviance plots for the test set show a similar 
pattern. Prediction power of the path solutions is evaluated on the test data set and the prediction MSEs 
are reported in the lower panels of Figure [51 The highly concave penalties such as power (77 < 1) and log 
penalty (77 = 0.1) are able to achieve the best prediction error rate 0.425 with 5 predictors. Convex penalties 
and less concave ones perform worse in prediction power, even with more than 5 predictors. 

5.3 Logistic regression with cubic trend filtering: M&A data 

The third example illustrates generalized regularization with logistic regression. We consider a merger and 



acquisition (M&A) data set studied in recent articles (jZhou and Wu 



2011 



Fan et al 



2011 ). This data 



set constitutes n = 1,371 US companies with a binary response variable indicating whether the company 
becomes a leveraged buyout (LBO) target (y,; = 1) or not (yi — 0). Seven covariates are recorded for each 



20 



enet(1), 0.3s enet(1.5), 0.2s enet(2), 0.07s 




20 40 50 100 200 400 600 



power(0.5), 0.2s power(1), 0.2s log(0), 0.3s 




10 20 20 40 1000 2000 3000 



log(1), 0.5s mcp(0.1), 0.2s scad(3.7), 0.2s 




10 20 30 40 50 20 40 20 40 

P 



Figure 5: Solution paths for the South Africa heart disease data. 

company. Tabled] lists the 7 predictors and their p- values in the classical linear logistic regression. Predictors 
'long term investment', 'log market equity', and 'return S&P 500 index' show no significance while the finance 
theory indicates otherwise. 



Predictor 


p- value 


Cash Flow 


0.0019 


Case 


0.0211 


Long Term Investment 


0.5593 


Market to Book Ratio 


0.0000 


Log Market Equity 


0.5099 


Tax 


0.0358 


Return S&P 500 Index 


0.2514 



Table 1: Predictors and their p- values from the linear logistic regression. 
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Figure 6: Upper panels: Negative deviance vs model dimension from various penalties for the South Africa 
heart disease data. Lower panels: Prediction mean square error (MSE) vs model dimension from various 
penalties for the South Africa heart disease data. 

To explore the possibly nonlinear effects of these quantitative covariates, we discretize each predictor into, 
say, 10 bins and fit a logistic regression. The first bin of each predictor is used as the reference level and effect 
coding is applied to each discretized covariate. The circles (o) in Figure |5] denote the estimated coefficients 
for each bin of each predictor and hint at some interesting nonlinear effects. For instance, the chance of being 
an LBO target seems to monotonically decease with market-to-book ratio and be quadratic as a function of 
log ma rket equity. Regulari zation can be utilized to borrow strength between neighboring bins. The recent 



paper IjZhou and Wu 



20111 ) applies cubic trend filtering to the 7 covariates using i\ regularization. Here 
we demonstrate a similar regularization using a non-convex penalty. Specifically, we minimize a regularized 
negative logistic log-likelihood of the form 

7 

where (3 ■ is the vector of regression coefficients for the j-th discretized covariate. The matrices in the 
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regularization terms are specified as 



/ -1 
1 



2 -1 
-4 6 
1 -4 

1 



1 

-4 



1 



V 



L 6-4 1 

-12-1/ 

which penalizes the fourth order finite differences between the bin estimates. Thus, as p increases, the coeffi- 
cient vectors for each covariate tend to be piecewise cubic with two ends being linear , mimicking the natural 



cubic spline. This is one example of polynomial trend filtering ([Kim et al 



2009: 



Tibshirani and Taylor 



20111 ) applied to logistic regression. Similar to semi-parametric regressions, regularizations in polynomial 
trend filtering 'let the data speak for themselves'. In contrast, the bandwidth selection in semi-parametric 
regression is replaced by parameter tuning in regularizations. The number and locations of knots are auto- 
matically determined by the tuning parameter which is chosen according to a model selection criterion. The 
left panel of Figure [7] displayed the solution path delivered by the power penalty with 77 = 0.5. It bridges 
the unconstrained estimates (denoted by o) to the constrained estimates (denoted by +). The right panel of 
Figure [7] shows the empirical Bayes criterion along the path. The dotted line in Figure |S] is the solution with 
smallest empirical Bayes criterion. It mostly matches the fully regularized solution except a small 'dip' in 
the middle range of 'log market equity'. The classical linear logistic regression corresponds to the restricted 
model where all bins for a covariate coincide. A formal analysis of deviance indicates that the regularized 
model at p = 2.5629 is significant with respect to this null model with p-value 0.0023. 

The quadratic or cubic like pattern in the effects of predictors 'long term investment', 'log market equity', 
and 'return S&P 500 index' revealed by the regularized estimates explain why they are missed by the classical 
linear logistic regression. These patterns match some existing finance theories. For instance, Log of market 
equity is a measure of company size. Smaller companies are unpredictable in their profitability and extremely 
large companies are unlikely to be an LBO target because LBOs are typically financed with a large proportion 
of external debt. A company with a low cash flow is unlikely to be an LBO target because low cash flow 
is hard to meet the heavy debt burden associated with the LBO. On the other hand, a company carrying 
a high cash flow is likely to possess a new technology. It is risky to acquire such firms because it is hard 
to predict their profitability. The tax reason is obvious from the regularized estimates. The more tax the 
company is paying, the more tax benefits from an LBO. 

5.4 GLM sparse regression: large p, small n 

In all of the previous examples, the number of observations n exceeds the number of parameters p. Our final 
simulation example evaluates the performance of the path following algorithm and empirical Bayes procedure 
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Figure 7: Regularized logistic regression on the M&A data. Left: The trajectory of solution path from power 
penalty with r\ = 0.5. Right: Empirical Bayes criterion along the path. Vertical lines indicate the model 
selected by the empirical Bayes criterion. 

in a large p small n setup for various generalized linear models (GLM). In each simulation replicate, n = 200 
independent responses yi are simulated from a normal (with unit variance) , Poisson, and binomial distribution 
with mean fii respectively. The mean \ii is determined by a p = 10, 000 dimensional covariate Xi through a 
link function g(fJ-i) = a + xjft where a is the intercept. Canonical links are used in the simulations. For the 
linear model, <?(/i) = /i. For the Poisson model, <?(/i) = In/i. For the logistic model, g(/j,) = log[/i/(l — fi)]. 

Numerous settings can be explored in this framework. For brevity and reproducibility, we only display 
the results for a simple exemplary setup: entries of covariate Xi are generated from iid standard normal 
and the true regression coefficients are /3i =3 for i = 1, . . . , 5, (3i — —3 for i = 6, . . . , 10, and a = fii = 
for i — 11, . . . , 10, 000. Results presented in Figure IMT21 are based on 100 simulation replicates. In each 
replicate, path following is carried out under linear, Poisson, and logistic regression models coupled with 
power penalties at r\ = 0.25, 0.5, 0.75, 1, representing a spectrum from (nearly) best subset regression to lasso 
regression. Results for other penalties are not shown to save space. Path following is terminated when at 
least 100 predictors are selected or separation is detected in the Poisson or logistic models, whichever occurs 
first. Results for the logistic sparse regression have to be interpreted with caution due to frequent occurrence 
of complete separation along solution paths. This is common in large p small n problems as the chance of 
finding a linear combination of a few predictors that perfectly predicts the n — 200 binary responses is very 
high when there are p = 10, 000 candidate covariates. Therefore the results for logistic regression largely 
reflect the quality of solutions when path following terminates at complete separation. 

Figure [9] displays the boxplots of run times at different combinations of the GLM model and penalty 
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Figure 8: Snapshots of the path solution to the regularized logistic regression on the M&A data set. The 
best model according to empirical Bayes criterion (dotted line) most matches the fully regularized solution 
(line with crosses) except that it has a dip in the middle part of the 'log market equity' variable. 



value 77. The majority of runs take less than one minute across all models, except Poisson regression with 
i] = 0.75. The run times in this setting display large variability with a median around 50 seconds. Logistic 
regression with non-convex penalties (77 = 0.25, 0.5, 0, 75) has shorter run times than lasso penalty (77 = 1) 
due to complete separation at early stages of path following. 

Figures [TU] and [TT] display the false positive rate (FPR) and false negative rate (FNR) of the model 
selected by the empirical Bayes procedure at different combinations of GLM model and penalty value •;/. 
FPR (type I error rate) is defined as the proportion of false positives in the selected model among all true 
negatives (9990 in this case). FNR (type II error rate) is defined as the proportion of false negatives in the 
selected model among all true positives (10 in this case). These two numbers give a rough measure of model 
selection performance. For all three GLM models, power penalties with larger 77 (close to convexity) tend 
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to select more predictors, leading to significantly higher FPR. On the other hand, the median FNR appears 
not significantly improved in larger rj cases, although they admit more predictors. This indicates the overall 
improved model selection performance of non-convex penalties. 

More interesting is the mean square error (MSE) of the parameter estimate (3 at the model selected by 
the empirical Bayes procedure. MSE is defined as EiO^j — fi) 2 /p} 1 ^ 2 - Figure [T2l shows that lasso (rj = 1) 
has risk properties comparable to the non-convex penalties, although it is a poor model selector in terms of 
FPR and FNR. 

We should keep in mind that these results are particular to the specific simulation setting we presented 
here and may vary across numerous factors such as pairwise correlations between the covariates, signal to 
noise ratio, sample size n and dimension p, penalty type, etc. We hope that the tools developed in this 
article facilitate such comparative studies. The generality of our method precludes extensive numerical 
comparison with current methods as only a few software packages are available for the special cases of 



In supplementary materials, we compare the run times of our algorithm to that of 
for the special case of lasso penalized GLM. 
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Figure 9: Run times of path following for GLM sparse regression with power penalties from 100 replicates. 
Problem size is n — 200 and p — 10, 000. 
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Figure 10: False positive rate (FPR) of the GLM sparse model selected by the empirical Bayes criterion. 
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Figure 11: False negative rate (FNR) of the GLM sparse model selected by the empirical Bayes criterion. 
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MSE of Emprical Bayes Model 
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Figure 12: Mean square error (MSE) of the parameter estimate from the model selected by the empirical 
Bayes criterion. 



6 Discussion 



In this article we propose a generic path following algorithm for any combination of a convex loss function 
and a penalty f unction that satisfies mild conditions. Although motivated by the unpublished work by 



Friedman! (|2008l ). our algorithm turns out to be different from his general path seeking (GPS) algorithm. 
Further research is needed on the connection between the two. The ODE approach for path following tracks 
the solution smoothly and avoids the need to choose a fixed step size as required by most currently available 
regularization path algorithms. 

Motivated by a shrinkage prior in the Bayesian setting, we derived an empirical Bayes procedure that 
allows quick search for a model and the corresponding tuning parameter along the solution paths from a 
large class of penalty functions. All necessary quantities for the empirical Bayes procedure naturally arise 
in the path algorithm. 

Besides sparse regression, simple reparameterization extends the applicability of the path algorithm to 
many more generalized regularization problems. The cubic trend filtering example with the M&A data 
illustrates the point. 

Our numerical examples illustrate the working mechanics of the path algorithm and properties of different 
penalties. A more extensive comparative study of the penalties in various situations is well deserved. The 
tools developed in this article free statisticians from the often time consuming task of developing optimization 
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algorithms for specific loss and penalty combination. Interested readers are welcome to use the SparseReg 
toolbox freely available on the first author's web site. 
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Supplementary Materials 

Derivatives of penalty functions 

First two derivatives of commonly used penalty functions are listed in Table [5J 

Penalty Function P„ , p) ^mM^ 

Power Family 77 £ (0,2] p|/3|" H/ 3 !"" 1 

Elastic Net, r, £ [1,2] p[(rj - l)p 2 /2 + (2 - r,)|/3|] - 1)|/3| + (2 - r?)] 

Log, 77 >0 pMv+\,8\) piv+lP])- 1 

Continuous Log p\n(y/p + |/3|) + |/3|) 

SCAD, 77 > 2 See© p {l m < p} + ^^ffi U\P\>p}} 

MC+, r? > See © p (l - 



ap? si/3! Sp 



Power Family r\ £ (0, 2] 


P 77 ( 77 -i)i/?r 2 


vwr 1 


Elastic Net, 77 g [1,2] 




(77-l)|/3| + (2 -77) 


Log, r] > 






Continuous Log 


-p(vp+\p\r 2 


(s/p+pr'-^vpi/p+iPir 2 

1 {\!3\<p} + _ !) 1:L {l/5|e[p^p)} 


SCAD, 77 > 2 


-(V- 1 ) ll {\0\€[p,vp)} 


MC+, 77 > 




hm<pv} 



Table 2: Some commonly used penalty functions and their derivatives. 



Thresholding Formula for Least Squares with Orthogonal Design 

We drop subscript j henceforth to prevent clutter. 

1. For the DP (77) penalty, the objective function becomes 

|(/3-6) 2 + pln(77+|/3|). 
Setting derivative to 0, we find that the optimal solution is given by 

, m-,)+m + f-± P /«V<\ gn{b) p e [0) \ aH] 

Kp)= <0or (|fc| ' T ' )+[(l ' ,| + T ' )2 ' 4p/all/2 sgn(6) p € {\abr)\,a(r) + \b\f/4) ■ 
[0 ' pe [a(r,- +|&|) 2 /4,oc) 

The ambiguous case reflects the difficulty with non-convex minimization and has to be resolved by 

comparing the objective function values at the two points. Moving from one local minimum at to 

the other non-zero one results in a jump somewhere in the interval [|<x6?7j , a(rj + |6|) 2 /4]. Note that at 

p = \abrj\, (3(p) = [(\b\ — rf) + \\b\ — r/W/2, which is zero when 77 > \b\. Therefore, the path is continuous 

whenever rj > 

2. For the continuous DP penalty, the objective function is 
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with the solution 

for p < a 2 6 2 . Note at p = a 2 6 2 , /3(p) = [(6 - ab) + \b — ab\]/2, which is when a > 1. Therefore, when 
a > 1, the solution path is continuous 

r MM^ s#) pe[0,a 2 fo 2 ] 
k p€ [a 2 & 2 ,oo) ' 

When a < 1, the solution path is given by 

/ 3 (pH { (|b| -^ )+[(|b|+ 2 ^ )2 - 4p/all/2 Sg nW Pe[0,p1 
\o pe[p*,oo) 

where p* > a 2 6 2 indicates the discontinuity point and shall be determined numerically. 

3. For power family the objective function is 

For rj e (0, 1), the solution (3(p) is the unique root of the estimating equation 

a(/3-b)+pr 1 \p\ r i- 1 sgn(p)=0 

or 0, whichever gives a smaller objective value. For 77 = 1, it reduces to the well-known soft thresholding 
operator for lasso /3(p) = median{±p/a + b, 0}. For the convex case 77 G (1,2], the solution /3(p) is 
always the (nonzero) root of the estimating equation, i.e., no thresholding; only shrinkage occurs. 

4. For elastic net, the objective function is 

|( / 3-6) 2 +p[(r ? -l)/? 2 /2 + (2 -77)1/31] 
with a continuous path solution 

ab ± p(2 - 77) 



/3(p) = median <^ 0, 

I a + p(r?-l) 

Again the lasso soft thresholding is recovered at r\ = 1 and ridge shrinkage is achieved at 77 = 2. 
5. For SCAD, the minimum of the penalized objective function over [0, p] is 

Pi(p) = sgn(6)min{r,ui}l r>0 , 
where r = (a\b\ — p)/a and u\ = min{p, and the minimum over [p, r\p\ is 

pl{2r>p+« 2 } + u 2^{2r<p+u 2 } a (V ~ 1) < 1 

(3 2 (p) = sgn(6) • 1 pl{r,p>\b\} + u 2 l{ w <|fe|} 0(77 - 1) = 1 , 

,pl{r<p} + rl {re[P;U2 ] } + u 2 l{ r> u 2 } a{r) - 1) > 1 
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where 1*2 = miri{r]p, \b\} and r = [ab(rj — 1) — rjp]/[a{rj — 1) — 1]. When |6| < p, the solution is /3i(p). 
When \b\ £ (p, ryp], the solution is either Pi(p) or ^(p), whichever gives the smaller penalized objective 
value. When |6| > r]p, the solution is /3i(p), /^(p) or /^(p) = whichever gives the smallest penalized 
objective value. 

6. For MC+, the path solution is either 

!b*l{2r<b*} ar) < 1 

b*l{ P <a\b\} 077 = 1, 

min{r,6*}l {r>0} a?7 > 1 

where b* — min{p?7, and r = — (p— a\b\)/(a — r/^ 1 ), or (3(p) = b, whichever gives a smaller penalized 
objective value. 

Numerical Comparisons 



Figur e [T3l displays the run times of lasso penalized GLM by the GLMNet package in R ([Friedman et al 



2010), which is the state-of-the-art method for calculating the solution paths of GLM model with enet 
penalties. It applies coordinate descent algorithm to a sequence of tuning parameters with warm start. The 
simulation setup is same as in Section [5.41 and the top 100 predictors are requested from the path algorithm 
using its default setting. In general GLMNet shows shorter run times than the ODE path algorithm (last 
column of Figure |9]). However a major difference is that GLMNet only computes the solution at a finite 
number of tuning parameter values (100 by default), while the ODE solver smoothly tracks the whole solution 
path. 
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GLMNet Run Time for Lasso 
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Figure 13: Run times of GLMNet for the lasso problems from 100 replicates. Problem size is n — 200 and 
p = 10, 000. Top 100 predictors are requested. 
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